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Abstract 


The  asymptotic  properties  of  extensions  of  the  type  of  distributed  or 


decentralized  stochastic  approximation  proposed  i-n-  [Tf  are  developed.  Such 


algorithms  have  numerous  potential  applications  in  decentralized  estimation, 


detection  and  adaptive  control,  or  in  decentralized  Monte  Carlo  simulation 


for  system  optimization  (where  they  can  exploit  the  possibilities  of  parallel 


processing).  The  structure  involves  several  isolated  processors  (recursive 


algorithms)  who  communicate  to  each  other  asynchronously  and  at  random 


intervals.  The  asymptotic  (small  gain)  properties  arc  derived.  The 


communication  intervals  need  not  be  strictly  bounded  and  they  and  the 


system  noise  can  depend  on  the  (communicating)  system  state.  State  space 


constraints  are  also  handled.  In  many  applications,  the  dynamical  terms  are 


merely  indicator  functions,  or  have  other  types  of  discontinuities.  The 


’typical*'  such  case  is  also  treated,  as  is  the  case  where  there  is  noise  in  the 


communication.  The  linear  stochastic  differential  equation  satisfied  by  the 


(interpolated)  asymptotic  normalized  error  sequence  is  derived,  and  issued  to 


compare  alternative  algorithms  and  communication  strategies.  Weak 


convergence  methods  provide  the  basic  tools. 
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1.  Introduction 

In  [1],  [2],  Tsitsiklis  proposed  a  very  interesting  model  for  a 

decentralized  (distributed)  recursive  algorithm  of  the  stochastic 
approximation  (SA)  type,  with  only  asynchronous  communications  between 
the  separate  processors,  and  developed  a  scheme  for  proving  w.p.l 
convergence.  That  work  appears  to  be  the  first  of  its  type  -  for  the 
decentralized  SA  problem.  Such  distributed  algorithms  are  of  rapidly 
growing  interest.  Various  potential  applications  in  adaptive  control, 
estimation  and  in  communcation  networks  were  proposed;  e.g.,  several 
processors  might  do  an  identification  of  the  parameters  of  an  identical 
linear  system  (but  with  different  inputs)  and  occasionally  (asynchronously) 
share  their  latest  estimates,  or  several  processors  might  do  monte  carlo 
simulations  of  the  SA  type  to  locate  the  minimum  of  a  regression  function, 
and  occasionaly  share  their  estimates.  There  are  two  main  purposes  for 
algorithms  of  the  type  discussed  here  and  in  [1]:  to  exploit  the 

opportunities  provided  by  parallel  processing  for  monte-carlo  methods  of 
system  optimization  or  evaluation;  situations  in  which  there  are  physically 
separate  systems  (estimators,  trackers,  controllers)  which  act  on  or  follow 
essentially  the  same  physical  system  -  and  which  occasionally  communicate 
to  take  advantage  of  the  ‘others’  information. 

The  assumptions  in  [1]  were  fairly  strong  with  respect  to  the  great 
variety  of  potential  applications,  and  the  method  of  analysis  required 
numerous  detailed  estimates.  We  analyze  essentially  the  same  algorithm  here. 
In  addition  to  getting  the  basic  convergence  results,  our  methods  can  handle 


the  constrained  (projected)  algorithm,  the  case  where  the  communication 
intervals  and  noise  depend  on  the  state,  the  general  rate  of  convergence 
problem,  and  the  case  where  there  is  communication  noise.  Instead  of 
letting  the  ‘gain’  parameter  go  to  zero  as  n  -  ®  (as  frequently  done  in 
classical  SA)  we  keep  it  a  constant,  and  work  with  convergence  in  the  sense 
of  weak  convergence.  There  are  several  reasons  for  this.  First,  when 
working  with  practical  systems  the  chosen  gains  almost  never  go  to  zero, 
since  one  usually  wants  an  algorithm  that  can  track  slow  changes  and  is 
robust  with  respect  to  large  bursts  of  noise.  Our  method  can  be  adapted  to 
get  weak  and  even  w.p.l  convergence  when  the  gains  do  go  to  zero,  and  we 
comment  on  this  in  Section  8.  Even  if  the  gains  do  go  to  zero,  w.p.l 
convergence  is  not  much  more  useful  or  interesting  than  weak  convergence. 
Weak  convergence  methods  locate  the  points  where  the  process  spends  most 
time  (asymptotically),  and  as  time  goes  to  00  ,  an  increasing  (to  one) 
proportion  of  time  is  spent  arbitrarily  close  to  such  points.  Then,  one  can 
often  use  the  powerful  ‘large  deviations’  methods  to  show,  under  very  broad 
conditions,  that  ultimate  escape  from  a  small  neighborhood  of  such  points  is 
impossible  (when  the  gains  go  to  zero)  [3],  [4],  Alternatively,  once  the  weak 
convergence  methods  have  located  the  ‘stable  points’,  perturbed  Liapunov 
methods  such  as  that  in  [10]  can  often  be  used  to  get  w.p.l  convergence. 
One  of  the  key  questions  in  the  analysis  of  any  algorithm  is  the  rate  of 
convergence  (the  asymptotic  normalized  variance),  and  the  analysis  of  the 
‘rate’  is  almost  always  done  via  weak  convergence  methods.  General 
background  and  applications  in  many  areas  arc  in  [6]  to  [8].  Weak 


convergence  methods  are  also  much  easier  to  use  than  the  standard  w.p.  1 
oriented  methods;  in  many  cases,  a  valid  result  can  be  obtained  almost  by 
inspection.  This  and  the  wide  variety  of  problems  which  can  be  handled 
make  it  a  more  widely  useful  tool  than  ‘w.p.l’  methods.  The  symbol  ^  is 
used  to  denote  weak  convergence,  and  some  definitions  and  properties  of 
this  convergence  are  stated  in  the  Appendix  1. 

The  methods  used  here  are  quite  efficient.  Problems  with  potentially 
unbounded  intercommunication  intervals  (e.g.,  where  the  interval  is 
geometrically  distributed)  can  be  handled.  We  can  also  treat  important  cases 
where  the  dynamics  are  discontinuous  or  where  the  communication  intervals 
and  system  noise  depend  on  the  system  state,  or  where  there  are  state  space 
constraints.  The  case  of  discontinuous  dynamics  is  of  considerable 
importance  in  applications:  often  an  estimate  increases  or  decreases  by  a 
fixed  amount  €  -  depending  simply  on  whether  a  certain  event  occurred  or 
not.  Similarly,  for  state  dependent  communication  times;  a  processor  might 
want  to  communicate  if  either  a  given  amount  of  time  has  passed  since  the 
last  communication  or  if  the  state  of  the  processor  has  changed  by  more  than 
a  given  amount.  In  many  applications  (e.g.,  the  decentralized  form  of  the 
automata  routing  problem  in  [5])  the  noise  is  naturally  state  dependent. 

A  theory  of  ‘rate  of  convergence’  is  also  developed,  which  allows  an 
objective  comparison  among  alternative  algorithms.  Using  this,  in  Sections  6 
and  7,  we  comment  on  and  compare  the  behavior  of  the  algorithm  with  the 
centralized  and  various  ‘deterministically’  decentralized  forms,  in  order  to 


get  a  better  understanding  of  its  behavior,  and  to  see  what  are  the  preferable 
communication  strategies.  We  can  also  allow  ‘noise’  in  the  communication, 
such  as  might  be  the  case  if  the  processors  were  physically  separated  and 
communicated  via  a  noisy  radio  link.  See  Section  7. 

The  basic  algorithm  will  be  described  next.  Section  2  contains  a 
‘technical’  estimate  which  will  be  useful  in  the  sequel.  Section  3  deals  with 
the  basic  weak  convergence  result  in  the  function  spaces  D[0,®)  or  C[0,“)  (see 
Appendix  for  the  definitions),  and  shows  that  a  suitable  continuous  time 
interpolation  X€()  of  the  iterates  (Xn)  converges  weakly  to  the  solution  of  a 
certain  ODE  as  the  gain  parameter  €  -*  0.  The  state  dependent  noise/inter¬ 
communication  time  case  and  the  discontinuous  dynamics  case  are  also 
treated  there.  Section  4  concerns  a  ‘projection’  algorithm  to  handle  state 
space  constraints.  Here,  the  limit  satisfies  a  ‘projected’  ODE.  The 

asymptotics  of  X€(te+  )  are  dealt  with  in  Section  5,  where  t€  -*  00  as  e  -*  0. 
This  yields  the  ultimately  desired  result  concerning  the  location  of  the 
iterates  for  large  n  and  small  €.  Finally,  the  rate  of  convergence  and 
comparison  with  a  centralized  processor  is  developed  in  Sev.ion  6  and  7.  A 
discussion  of  some  of  the  probable  advantages  and  uses  of  the  algorithm 
appears  in  Section  7.  Section  8  contains  a  comment  on  the  case  where  t  is 
replaced  by  €n  -  0. 

The  basic  algorithm.  We  assume  that  there  are  q  parallel  processors, 
each  with  a  state  variable  of  dimension  r.  Let  X1  denote  the  state  of 

n 

processor  i  at  time  n  and  define  Xn  =  (X*,  ...,  Xp.  The  symbol  X  generally 

denotes  a  qr-vector  which  we  partition  as  X  =  (X1,  ...,  Xq),  where  each  X'  is 


an  r-vector.  The  ‘observation’  of  processor  i  at  time  n  is  b'(X^, 5^),  where  5^ 

is  the  ‘noise’.  Write  5n  =  (?*,  ....  {  =  U1 .  5q),  and  B(X,5)  =  (b^X1,*1), 

bq(Xq,£q)).  Write  b'(XU')  =  (b'^X1,?1),  b‘(XU')),  the  b^  )  being  scalar 

valued.  (All  the  above  vectors  are  column  vectors.)  For  vectors  X1  in  Er,  we 
often  write  simply  x. 

Let  {An}  be  a  sequence  of  (possibly  random)  qr  x  qr  matrices,  where  An 
can  be  written  in  the  form 

an(n)  ■  ■  ■  aql(n) 

An  = 

,aiq(n)  ■  '  ' 

where  each  a^(n)  is  a  diagonal  r  x  r  matrix  with  non-negative  entries  and 
Eia(j(n)  =  Ir  the  identity  matrix  in  Er,  Euclidean  r-space  (i.e.,  the  ’matrix 
valued’  rows  of  An  are  ’convexifying’).  Suppose  that  there  is  a  scalar  aQ  >  0 
such  that  au  J  aQIr  and,  for  i  *  j,  either  a;.(n)  =  0  or  else  a^(n)  £  aQIr. 

The  algorithm  is 

Xn+i  =  E  aji(n)Xi+  <bWnJn) 

(1.1)  J 

Xn+1  =  AnXn  +  tB(X n ,ln>’ 

At  time  n,  processor  i  (i  =  l,...,q)  decides  whether  or  not  to  communicate  the 
current  value  of  its  state  to  any  other  processor  and  takes  an  observation 
b’(X^,{^).  If  there  is  no  communication  to  processor  i,  then  we  set  aH(n)  =  If 
and  aj;(n)  =  0  for  j  /  i,  and  the  iteration  (for  processor  i  at  time  n)  is  of  the 
standard  SA  type:  Xjj  j  =  X^  +  €b'(X^,^).  If  there  are  any  communications 


to  processor  i  from  some  processors  j  *  i  at  time  n,  then  for  such 


communicating  processors  j,  a^n)  S  «0Ir  and  the  updated  state  X^+I  for 
processor  i  is  a  convex  combination  of  X1  and  of  the  states  X-* 

n  n 

communicated  to  it,  added  to  its  own  SA  increment  eb'fX'E1).  The 

n  n' 

requirement  that  either  (for  j  *  i)  aj;(n)  *  oQIr  or  a^n)  =  0  simply  means  that 
if  processor  j  communicates  to  processor  i  at  time  n,  processor  i  can  choose 
to  ignore  the  communication,  but  if  it  incorporates  the  received  into  its 
own  state,  it  must  do  so  in  a  ‘non-triviaP  way.  For  notational  simplicity,  we 
omit  the  symbol  for  the  {-dependence  of  Xn. 

In  [I],  the  algorithm  was  slightly  more  complex,  since  the  dimensions  of 
the  X1  were  not  necessarily  the  same  and  a  somewhat  more  complicated  block 
structure  of  An  was  used.  But,  with  no  additional  mathematical  work 
(although  with  a  more  complex  notation),  such  extensions  can  readily  be 
incorporated  into  our  framework.  It  should  be  clear  from  the  development, 
that  many  related  algorithms  and  conditions  can  be  treated  by  essentially 


identical  methods. 


2.  Some  Preparatory  Estimates 


This  section  is  devoted  to  obtaining  the  rate  of  convergence  of  the 
product  A  ...Ak  as  n  -  00  .  We  use  the  assumption 

C2.1.  Let  Fn  be  an  increasing  sequence  of  o-aleebras  such  that  Fn  measures 
{Xj,  i  S  n,  4;,  Ap  i  <  n}.  There  are  a  scalar  P0  >  0  and  integer  mQ  such  that 

(2.1)  P^r  {processor  i  communicates  to  processor  j  on  [n,n+m0)}  S  pQ 

n 

for  all  n  and  i,j,  and  i  *  j. 

Remark.  In  [1],  it  was  assumed  that  there  is  an  mQ  such  that  p0  =  1. 

(C2.1)  covers  the  case  where  at  each  instant  each  processor  flips  a  coin  to 

decide  whether  to  communicate  or  not.  More  generally,  there  often  is  a 
*/* 

process  (An)  such  that  {A;,  £.,  i  <  n,  X;,  i  S  n}  is  Markov,  and  An  is  a 
component  of  An.  With  this  model,  if  Fn  denotes  the  minimal  o-algebra 
which  measures  {A;,  ^.,  i  <  n,  X;,  i  S  n),  then  (C2.1)  covers  many  interesting 
cases  where  the  inter-communication  intervals  are  not  bounded  a  priori  — 
and  might  be  ‘state’  dependent.  The  condition  seems  to  be  unrestrictive. 

For  n  ?  k,  define  <J>(n|k)  =  An...Ak  and  set  <I>(n|n+l)  =  Iqr,  the  identity 
matrix  in  Eqr. 

Lemma  2.1.  Assume  (C2.1)  and  the  conditions  on  { An}  in  Section  1. 

Then  <J>k  =  limn<t>(n|k)  exists  w.p.l  and  for  each  i  S  r.  all  the  rows  i,  i+r . 

i+qr-r  of  <l>k  are  equal.  Also 

(2.1a)  E|<t>( n i k )  -  <J>k|  -  0  geometrically  as  n  -  k  -*  00  , 

(2.1b)  E p  l't'(nlk)  -  <t>k|  -  0  geometrically  as  n  -  k  -*  ®  , 


Remark.  The  fact  that  the  limit  <t>k  exists  is  almost  obvious  if  we  look  at 
the  {A  }  as  transition  matrices  for  a  Markov  chain. 

Proof.  The  proofs  of  (2.1a)  and  (2.1b)  are  essentially  the  same  and  only 
(2.1a)  will  be  proved.  We  will  evaluate  E|<t(n|k)  -  <t>k|  by  a  slight  variation  of 
the  proof  of  [1,  Lemma  5.2.1],  Owing  to  the  block  diagonal  structure  of  the 
A  ,  in  calculating  the  product  4>(n|k),  the  r  sets  of  rows  (i,  i+r,  ...,  i+qr-r),  i  « 
r,  do  not  interact  and  we  can  (and  will)  let  r  =  1  without  loss  of  generality. 

The  geometric  convergence  of  4>(n|k)  to  4>k  was  proved  in  [1,  Lemma  5.2.1] 
when  pQ  =  1  (sec  remark  below  (2.4)).  By  (C2.1),  there  are  >  0  and  an 
increasing  sequence  of  random  times  {N;)  such  that  the  components  of 
<t>(N2i+1|N2i)  are  all  ?  ocr  This  and  the  convergence  result  for  p0  =  1  implies 
that  4>(n|k)  converges  w.p.l  to  some  matrix  4>k,  as  n  -  00  .  All  the  rows  of  such 
a  limit  must  be  equal,  and  the  entries  of  each  row  must  sum  to  unity.  Let 
0k(l),  ...,  4>k(q)  denote  the  scalar  elements  of  any  row  of  4>k,  and  let  the 
vectors  vr  ...,  vq  span  Eq,  and  define  e  =  (1,  1,  ...,  1).  Define  c(x)  =  E0k(i)xs 
where  x  =  (xr  ...,  xq).  Both  e  and  x  are  column  vectors.  Then  <J>kx  =  c(x)c. 
All  norms  here  and  elsewhere  are  in  the  sense. 

For  a  matrix  M, 

|M|  =  sup  |Mx|  *  E  |Mv.|. 

I*l=i 


Thus,  we  need  only  show,  for  any  vector  x,  that  E|$>(n|k)x  -  c(x)c|  -  0 
geometrically.  Define  x(n|k)  =  <J>(n!k)x.  Let  c(n|k)  denote  the  minimum  value 


of  the  components  of  x(n|k).  We  can  write  x(n|k)  =  y(n|k)  +  c(n|k)e,  where  all 
the  components  of  y(n|k)  are  non-negative  and 

(2.2)  E|<Xn|k)x  -  c(x)e|  «  E|y(n|k)|  +  E|c(x)  -  c(n|k)|. 

By  the  ‘convexification’  properties  of  the  An, 

(2.3a)  |y(n  +  l|k)|  «  |y(n|k)| 

(2.3b)  c(n|k)  «  c(n  +  l|k)  «  c(n|k)  +  |y(n|k)|. 

By  (C2.1),  there  is  an  otQ  >  0  such  that  w.p.  pQ  (conditioned  on  Fn)  all  the 
elements  of  <t<n  +  m0|n)  are  *  aQ.  This,  together  with  the  ‘convexification’ 
property  of  the  An  implies  that 

(2.4)  E|y(n  +  mQ|k)|  «  (1  -  o0p0)E|y(n|k)|. 

(If  p0  =  1,  then  drop  the  E  in  (2.4),  and  (2.3),  (2.4)  yield  w.p.l  convergence.) 
The  asserted  geometric  convergence  is  a  consequence  of  (2.3),  (2.4)  and  the 
w.p.l  convergence  of  4>(n|k)  (hence  of  c(n|k)  to  c(x)).  The  last  sentence  of  the 
theorem  follows  by  a  similar  argument.  Q.E.D. 

Remark  on  Other  Cases.  One  can  readily  work  with  the  case  where  all 
of  the  processors  do  not  necessarily  communicate  with  each  other.  We 
comment  only  on  one  special  case.  Let  processors  1,  ...,  qj  communicate  to 
each  other  but  not  to  the  other  processors,  and  let  processors  q 1 ,  ...,  q. 
communicate  only  to  processors  1,  ...,  qj  but  not  to  each  other.  Then  4>(n|k) 
converges  geometrically  to  a  matrix  $k  which  takes  the  form 


3.  Convergence:  The  Limit  ODE 

Nonstate  Dependent  {An,?n}.  We  will  work  with  several  sets  of 
assumptions.  First,  the  basic  convergence  theorem  will  be  proved  when 
the  sequences  { An}  and  {£n}  are  non  state-dependent  and  independent  of 
each  other,  and  then  the  restrictions  will  be  weakened.  Let  Ek  denote 
expectation,  conditioned  on  {X.,  i  «  n,  A;,  i  <  n}.  We  will  use  subsets 
of  the  following  assumptions.  Theorem  3.1  is  the  basic  weak  convergence 
theorem,  from  which  most  other  results  will  follow.  The  conditions  do 
not  seem  to  be  restrictive. 

(C3.1)  {Ak}  and  {Sn}  are  independent  of  each  other. 

(C3.2)  B(X,0  «  B0(X,O  +  B^XH,  where  the  B.(-)  are  continuous,  (B0(-,O, 
uniformly  in  \),  {£k}  is  a  sequence  of  bounded  random  variables  and  Uk}  is 
a  sequence  with  zero  mean  and  bounded  4th  moment. 


(C3.3)  There  is  a  continuous  function  B(X)  =  (b1(X1) .  bq(Xq))  such  that 

EkB(xJn)  -B(X)  -  0,  £kTn  "*  0 
in  probability  for  each  X,  as  n  -  k  -  00  . 

(C3.4)  There  are  a  matrix  4>  and  a  sequence  m£  -*  00  such  that  €m€  =  6€- 
0  and 


E 


*E  *.  -  K 

n  k 


0,  un  iformlv  in  n. 


Remark  and  Definition.  Under  the  conditions  of  Lemma  2.1,  $  must 


where  the  are  diagonal  matrices  with  diagonal  denoted  by  ($ir  0ir) 
and  =  1.  For  any  vector  X  we  have  the  form  <t>X  =  (y,  y)  and 

A 

<t>kX  =  (yk .  yk)  for  some  y  and  yk  in  Er.  Let  4>  denote  the  row  of  r  x 

r  matrices  [0r  $  ].  Let  B(x)  denote  B(x,x,...x),  and  B(x,0  denote 

B(x,x,...,0- 

C3.5.  The  ODE  (3.1)  has  a  unique  solution  for  each  initial  condition. 


xi  =  ^n^i(x)  +  •  •  •  +  ^qi^i(x) 

(3.1)  =  $  B(x). 

xr  =  0lrbr\x)  +  .  .  .  +  $qrb?(x) 


C3.31.  There  are  a  continuous  B(-)  and  m£  -  00  such  that  em£  =  6£  -  0 
and 


1  n+m,-l  €  — 

—  r€  EnB(X,  \)  -B(X) 

m  ^  11  R 


m  e  n 

in  probability  for  each  X,  uniformly  in  n. 

C3.4'.  There  is  a  matrix  <t  such  that.  n  -  k  -  ®  , 

E|Ek4»n  -  «j  -  0. 

Let  n£  be  a  sequence  tending  to  “  and  such  that  v'?n€  -•  0,  and,  for  n  J 


sup  P(|<J>(k  +  n£|k)  -  <t>k|  ?  «2}  S  €2. 
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There  is  such  a  sequence,  by  Lemma  2.1.  In  fact,  we  can  use  n£  =  O(log  l/«). 
Define 

Xq  =  <Kn£|0)Xo  +  £  "£  1  ViB(Xk,5k) 
o 

and  for  t  *  0  define  X€(  )  by  Xe(t)  =  Xn  for  t  e  [(n-n£)«,  (n— n£  +  l)c).  Write 
X€(  )  =  (X€,1(  •),  ...,  X€,q(  •)).  It  will  turn  out  that,  for  any  initial 
conditions  X*,.  the  vectors  X'n,  i  S  q,  rapidly  come  close  together  (due  to  the 
communication  and  convexification).  This  leads  to  an  (asymptotic  in  e) 
jump  in  the  process  Xjt/£]  at  t  =  0.  For  this  reason,  we  start  Xe(  )  slightly 
away  (n€  steps)  from  the  origin  of  the  {Xn}  process. 

Theorem  3.1.  Assume  (C2.1),  the  conditions  on  (An)  in  Section  1,  and 
(C3.1 ),  (C3.2),  (C3.5)  and  either  (C3.3),  (C3.4)  or  (C3.3'),  (C3.41).  Then  X£(  ) 
js  tight  in  D[0,°°)  and  converges  weakly  to  a  process  X()  =  (x(  ),  ...,  x(  )), 

where  x(  •)  satisfies  (3.1)  with  initial  condition  xQ,  and  X(0)  =  lim£X£  =  (x0, 
...,  xQ). 

Proof.  Part  1.  The  proofs  are  essentially  the  same  for  the  pairs  (C3.3), 
(C3.4)  and  (C3.31).  (C3.41)  and  we  work  only  with  the  first  pair.  W'e  often 
use  Schwarz’  inequality  and  the  inequality  (for  a  2  0),  E|<t(n|k)  -  4>k|1+a  S 
constant  •  E|<t>(n|k )  -  4>k|,  without  specific  mention.  Iterating  (1.1)  and  letting  n 
?  n£  yields 

V1  n 

Xn+1  =  4Kn€l°)x0  +  €  f  4>(nlk+l)B(Xk,5k)  +  e  <Kn|k+l)B(Xk.(k) 


(3.2) 


=  X£  +  €  £  $k+1B(Xk,Sk)  +  +  Wn|0)  -  <Kn£|0)]X0, 


<.|Wn|k+l)-V,lB<Xk,(k>. 


For  the  purposes  of  the  weak  convergence  proof,  we  can  assume  (w.l.o.g.) 
that  {Xk}  is  bounded  by  simply  truncating  the  dynamical  terms;  i.e.,  changing 
B(-,0  so  that  it  is  zero  for  large  |XJ.  If  the  theorem  is  true  for  each  such 
truncation,  then  by  the  uniqueness  assumption  (C3.5),  it  is  true  as  stated. 
Henceforth  we  assume  this  boundedness. 

Part  2..  Next,  we  show  that  sup£  nE|</>*|3  <  00  .  All  norms  are  in  the 
sense.  We  have 

E|^|3  C  constant-  I  E|4<n|i+l)-4>j+1[|<I>(n|j+l)-iI>j+1||<I>(n|k+l)-$k+1|  • 

*»j»k 

[i  +  iTj  uTjiiTkii- 

By  Holder’s  inequality,  the  summand  is  bounded  above  by 

E1/12[|<J>(n|i+ 1  )-4>j+1|12  ■  E1/12|$(n|j+ 1  )-<tj+l|12  •  E1/12Mn|k+ 1  M>k+1l]12  • 

•  [  1  +E3/4lTil4  •  E3/4|Tj|4E3/4Uk|4]. 

By  (C3.2)  and  the  geometric  convergence  in  Lemma  2.1  and  the  boundedness 
of  <I>( n  1  i )  and  <t>.,  there  is  a  d  €  [0,1)  such  that  this  term  is  bounded  above  by 
(constant)  dn‘Idn‘-’dn'k.  Thus  sup£  nE{0^ |3  <  ”  .  From  this  and  (3.2)  (and  the 
truncation  of  B(  •,  •)) 

sup  E|Xn+1  -  X//€2  <  ~  , 

and  {|Xn+1  -  Xn|/€,  n  *  n£,  €}  is  uniformly  integrable.  Thus,  (X€(  )}  is  tight 
in  D[0,®)  and  all  limit  paths  are  Lipschitz  continuous  (in  t). 
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Part  3..  We  fix  and  work  with  a  weakly  convergent  subsequence  of 
(X€()},  also  indexed  by  €,  and  with  limit  denoted  by  X(  ).  Skorokod 
imbedding  (see  Appendix)  will  be  used  where  useful,  without  specific 
mention.  Thus,  we  can  assume,  where  needed,  that  X€(  )  -  X(  )  uniformly 
on  bounded  time  intervals,  w.p.l. 

We  will  show,  for  each  real  valued  function  f(  )  with  compact  support 
and  continuous  second  derivatives,  that  the  M^  )  defined  by 


(3.3)  M/t)  =  f(X(t))  -  f(X(0))  -  J‘  f^(X(s))4>  B(X(s))ds 


is  a  (continuous)  martingale.  Since  Mj()  is  a  Lipschitz  continuous 
martingale  (since  X(  )  is  Lipschitz  continuous),  it  is  a  constant.  Thus,  since 
Mj(0)  =  0,  we  have  M^t)  =  0  or,  equivalently,  X  =  *  B(X).  By  the  properties 

of  4>k  for  each  i  $  r,  the  i,  i+r,  ...,  i+qr  -  r  rows  of  <J>  are  equal.  Thus  all 

r-vector  components  of  the  limit  X(  )  must  be  equal,  i.e.,  X(  )  is  of  the  form 
(x(  •),  ...,  x(  ■)),  for  x(t)  €  Er.  This  and  X  =  <J>  B(X)  implies  that  x(-)  satisfies 
(3.1). 

We  need  only  show  the  martingale  property.  To  do  this,  we  need  only 

show  that  for  any  integer  p  and  continuous  bounded  h(  )  and  t;  ?  t,  i  «  p,  s 

>  0, 


j*t  "f-  8  —  — 

(3.4)  Eh(X(tj),  i  S  p)[f(X(t+s)  -  f(X(t))  -  J  f^(X(u))4>  B(X(u))du]  =  0. 


To  simplify  the  notation  (and  w.l.o.g.),  let  t  and  s  be  integral  multiples 
of  em€  =  6€  (see  (C3.4)  for  the  definition  of  m€)  and  define  the  index  set  1^ 
=  {n:  lm£  +  n£  «  n  <  lmf  +  m€  +  n€}.  By  Taylor's  Theorem  and  (3.2), 


4 


»*] 
’*  1 


f(X£(t+s))  -  f(X£(t))  =  I  [f(X£m  +m  +n  )  -  f(X£m  +n  )] 

t«J!6  ,<t+«  *m£-t-m£-t-n£  Jim£-t-n£ 


=  €  _ L  f'(X£ 


t  £  1  S  £  <t+s 


XXi6€+ne)  J,€  VlB(Xk>^k)  +  err0r  termS’ 


where  the  error  term  is  of  the  order  of  the  sum  of  (all  norms  are  in  the  ic 


sense) 


£  $l*l»6+n6l-  **  J  l^m£+n€l2-  £> 

I  l<t>(im£  +  m£  +  n£|0)  -  *(n £|0)||XJ, 


E  e2(l  +  UJ2), 

k  * 


where  the  sums  are  over  all  i  such  that  t  $  Jf6£  <  t  +  s  and  k  is  summed 
over  t  *  £k  -€n  <  t  +  s.  The  mean  values  of  the  error  terms  go  to  zero  as  e 


By  (3.5), 


lim  Eh(X£(t),  i  S  p)[f(X£(t+s))  -  f(X£(t))]  = 

(3.6)  £ 

Um  Eh(X‘„,>,i  «p)[«  I^UOW]. 


We  now  rearrange  the  terms  in  a  more  convenient  way.  Define 
8*1  -  -  -  E-€  ViB(Xk’<k> 


*  m£  k€ i|  *'ri  * 
and  define  the  function  B£(  )  by 

B  (t)  =  fx^Xim£+n£^im£+n£Bl  ^0r  *6€  *  1  <  iS£  +  8£' 

Since  X£(tj),  i  «  p,  is  measurable  on  the  o-algebra  fim£+„£  ,  for  «8  £  ?  t  (3.6) 
can  be  rewritten  as 


worn 


a 


£ 

’2 

$ 

I 


lim  Eh(X€(t;),  i  S  p) 
^  * 


I  fi(x!  .  )b€. 

i6£<t+s  X  im£+n€  i 


=  lim  Eh(Xe(t  ),  i  «  p)  ft+S  B€(u)du. 

€  1  Jt 


B€(u)  -  f jj(X(u))4>  B(X(u)) 


in  probability  for  almost  all  u,  then  the  second  limit  in  (3.7)  would  be 

Eh(X(t.),  i  «  p)/‘+S  f£(X(u))5  B(X(u))du. 

Using  this  and  take  limits  in  (3.6)  yields  the  desired  result  (3.4),  and  we  will 
be  done.  Thus,  we  need  only  show  (3.8). 

Fix  u  and  for  €  >  0,  define  f£  by  u  €  [i£6£,  {£6£  +  &£).  Then  we  need 


to  show  that 


a 


<3'9)  b  J,€  fx(Xk»V.B<Xl.V  J  fJ(X(U))»  B(X(U)). 


rn  k£I«  i£m€+n€ 
*  € 


By  (C3.2)  (and  the  truncation),  we  can  replace  the  X.  in  (3.9)  by  X, 

k  «  ^  "i  r 

without  changing  the  limit.  Using  this  and  the  independence  assumption 
(C3.1),  we  can  rewrite  (3.9)  as 


1 


'  '  k'  m  k€l|  fX(Xi  £m£+n£)Ei  £m£+n£  1  £  m£  +n£  B(  X  t  £  m£  +n£ '  *  k> 

*  € 

+  error  term, 


where  the  error  term  goes  to  zero  in  the  mean.  By  the  convergence  of  X€(  ) 
to  X(  )  and  using  J£S£  -  u,  and  (C3.3),  (C3.4),  we  g>_t  that  (3.10)  converges  in 
the  mean  to  the  right  side  of  (3.9)  as  €  -•  0  and  the  proof  is  concluded.  (The 
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‘intermediatc’  details  in  the  last  part  of  the  proof  are  very  similar  to  those  in 
the  ‘centralized’  case.  See  [8,  Chapter  5.2]  or  [9].).  Q.E.D. 

State  Dependent  {An>  and  {?n}  and/or  Discontinuous  Dynamics.  The  state 
dependent  ‘communication’  and  noise  is  most  conveniently  modeled  by  a 
‘Markov’  dependence.  This  will  allow  {An}  and  (5n)  to  depend  on  the  state  in 
a  variety  of  ways:  An  can  depend  (statistically)  on  recent  events  or  on 
changes  in  the  Xn-sequence  greater  than  a  given  magnitude  over  some  time 
interval,  or  on  time  elapsed  since  recent  communications  or  on  the  ‘levels’  of 
recent  communications  (i.e.,  the  degree  of  ‘convexification’  or  incorporation 
of  received  data  into  ones  own  estimate  can  depend  on  the  nature  of  or 
timing  of  recent  receptions,  transmissions,  etc.).  To  be  precise,  we  suppose 
that  there  is  a  bounded  sequence  of  random  variables  {An}  such  that  An  is  a 
component  of  An  and,  for  each  c  >  0,  (Xn,  An_,,  tnl)  is  a  Markov  process 
with  a  homogeneous  transition  function.  The  An  can  incorporate  other  data; 
c.g.,  time  elapsed  since  last  reception,  transmission,  etc.  The  case  where  some 
components  of  B(-,)  arc  merely  indicator  functions  (hence,  not  continuous 
functions)  is  of  particular  importance  in  applications.  Such 
‘Markovianizations’  seem  to  be  quite  natural  for  many  problems.  It  might  be 
hard  to  explicitly  evaluate  the  ODE’s  here,  but  the  character  of  the  results  is 
clear  and  precisely  what  is  wanted. 

Example.  For  one  example  of  the  appearance  of  state  dependent  noise, 
see  the  ‘routing’  problem  in  [5],  In  that  example,  inputs  to  a  service  or 


& 


communication  system  occur  at  random,  and  the  service  times  are  random 
(correlated  or  not).  The  parameter  x  (the  state)  determines  the  probability 


,v  .-  J.  -  . 


'HL 


.  - 

« r- 
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that  incoming  events  are  routed  along  particular  channels.  The  effective 
noise  is  a  consequence  of  the  queue  length  or  occupancy  level  of  each 
channel;  it’s  statistics  are  dependent  on  the  routing  parameter.  A  Markov 
dependence  model  was  appropriate  there.  The  routing  parameter  at  time  n 
increased  or  decreased  by  e  —  depending  on  whether  or  not  certain  events 
occured  at  time  n;  hence  the  dynamics  were  discontinuous.  The  model  used 
in  this  section  includes  ‘decentralized’  generalizations  of  such  problems. 

Assume  that  the  marginal  one-step  transition  function  is  of  the  product 
(conditionally  independent)  form,  for  some  Pc  and  PN 

(3.11)  P{AX  €  Bp  £  B2|XrA0,50)  =  P^Aj  £  BjIXj.A^U,  £  B1|X1,50), 

(C  denotes  ‘communication’,  N  denotes  ‘noise’).  The  Pc  and  PN  will  not 

depend  on  £.  We  can  allow  some  e-dependencc  -  but,  in  many  applications, 

£  is  merely  a  step  size  parameter  and  does  not  affect  the  distribution  of  the 
«✓* 

A  ,  A  or  £  other  than  via  the  values  of  the  X  (e.g.,  as  in  the  above 
example).  The  product  from  (3.11)  is  a  natural  generalization  of  (C3.1). 
Here  the  noise  and  intercommunication  intervals  are  independent,  conditional 
on  the  state.  For  each  fixed  X,  the  Pc  and  PN  in  (3.11)  can  be  considered  to 
be  one-step  transition  functions  for  ‘fixed  X’  Markov  chains  which  we 
denote  by  (An(X)},  Un(X)}.  Let  Pc(A,n,  |X)  and  PNU,n,  |X)  denote  the 
associated  n-step  transition  functions.  Then  PC{A,1,  |X)  =  Pc{Aj  £  |AQ  = 
A,X),  etc.  Let  E*  and  E*  denote  the  associated  expectations. 
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Several  assumptions  will  now  be  given,  followed  by  some  remarks 
concerning  extensions.  The  assumptions  are  phrased  so  as  to  cover  many 
potential  applications. 

(C3.6)  Ex[An...A1|A0  =  A]  =  Fn(A,X)  is  continuous  in  (A,X). 

(C3.7)  For  each  bounded  and  continuous  functions  f;(),  i  =  1.2. 

JfiUpPpjUJ.dSjlX)  and  Jf2(A1)Pc(A,l,dA1|X)  are  continuous  in  (£,X)  and 

w" 

(A,X)  respectively. 

(C3.8)  {{J  is  bounded. 

(C3.9)  For  each  X  of  the  form  X  =  (x,x,...,x),  let  the  pair  of  processes 
(An(X),5n(X)}  associated  with  the  n-step  transition  function 
Pc{A,n,  |X}PN{5,n,  |X}  have  a  unique  invariant  measure  and  which  is  of  the 
product  form  P£{  •  }P^{  • }. 

(C3.10)  J,B(X,*1)P»,l,d?1|X}  is  continuous  in  (X,0- 

Remark.  Since  the  two  fixed  A'-processes  are  independent,  the  product 
form  in  (3.9)  will  hold  if  the  processes  are  aperiodic.  Under  the  conditions 
of  Lemma  2.1,  the  Fn(A,X)  in  (C3.6)  converge  geometrically  (uniformly  in 
A.X)  to  a  function  $(A,X),  which  must  be  continuous  under  (C3.6).  By  the 
discussion  associated  with  Theorem  3.1,  we  sec  that  4>(A,X)  has  the  form 


*j(A,X) 


0q(A,X)  #(A,X) 


4KA,X)  = 


0x(A,X)  ■  ■  ■  0q(A,X)  J  [<KA,X) 


where  4r(A,X)  is  a  diagonal  (r*r)  matrix.  Write  <fr(A,X)  =  diagf^^A.X).  .... 
4>  (A.X)]. 


-21- 


i 


If  X  takes  the  form  X  =  (x,x,...,x)  for  x  €  Er,  we  simply  might  write  x 
for  X. 

(C3.ll)  The  ODE  (3.12)  has  a  unique  solution  for  each  initial  condition 
(analogous  to  (3.1)  —  the  A  and  l  are  simply  averaged  out  with  respect  to  the 
invariant  measure) 


(3.12) 


where 


Xj  =  E  J  ^(A,x)Pxc{dA)  J  b{(x,U)P*(dU) 
xr  =  E  j*  0jr(A,x)P*{dA}  J  bJr(x,lj)Px(d!J) 


=  $(x(u)B(x(u)) 


$(X)  =  J$(A.x)P*(dA),  B(x)  =  J  B(x,()P^,(d(). 


Write 


Under  (C3.7),  (C3.9)  and  (C3.10),  the  right  side  of  (3.12)  is  continuous. 

Remarks  on  the  Assumptions.  In  many  applications,  A  takes  only  a 
finite  number  of  values.  Then  the  appropriate  topology  is  the  discrete 

N/*> 

topology  and  the  A-continuity  required  in  (C3.6)  and  (C3.7)  always  holds  -- 

v" 

since  then  all  functions  of  A  are  continuous.  The  1-step  smoothing 
assumption  in  (C3.10)  can  be  replaced  by  a  k-step  smoothing  assumption  -- 
and  Theorem  3.2  will  still  hold.  Since  <WA.X)  is  continuous  (see  above 
remark),  a  <J>k-analog  of  (C3.10)  is  not  needed.  In  (3.12)  we  arc  simple 
averaging  the  dynamics  with  respect  to  the  invariant  measures.  If  the 


invarian'  measure  is  not  unique,  then  the  right  side  of  (3.12)  is  set  valued 
and  P*  and  P£  range  over  all  the  invariant  measures.  We  use  (C3.8)  here  to 
avoid  some  details.  Extensions  to  cover  typical  unbounded  Un)  cases  arc 
possible  via  essentially  the  same  method.  To  see  how  this  might  be  done,  see 
the  proof  for  the  ‘centralized’  case  in  [8,  Chapter  5.3]  or  in  [9]. 

Theorem  3.2.  Assume  (C2.1),  the  conditions  on  {An}  in  Section  I,  (C3.2) 
(without  the  $  component)  and  (C3.6)  to  (C3.10).  Then  (X€(  )}  is  tight  in 
D[0,°°)  and  converges  weakly  to  X(  )  =  (x(  ),  ...,  x(  )),  where  x(  )  satisfies 
(3.12)  and  X(0)  =  (x(0),  ...,  x(0))  =  «t>0XQ. 

Proof.  (X€(  )}  is  tight  and  all  limits  are  Lipschitz  continuous  for 
the  same  reasons  as  in  Theorem  3.1.  Let  e  index  a  weakly  convergent 
subsequence  with  limit  denoted  by  X(  ).  As  in  Theorem  3.1,  X(  )  has 
the  form  X(  )  =  (x(  ),  ...,  x(  )).  Owing  to  the  Markov  assumption,  Ek 

v/* 

denotes  conditioning  on  (Xk,l;k  .1,Ak_1).  By  the  method  of  proof  of 
Theorem  3.1,  we  need  only  show  that  the  left  side  of  (3.9)  converges  in 
probability  to  fx(X(u))<J>(X(u))B(X(u))  for  X(u)  of  the  form  X(u)  =  (x(u), 
...,  x(u)).  The  fx  term  does  not  play  an  important  role  and  we  discard  it 
henceforth. 

We  use  the  ‘truncation’  method  and  notation  discussed  in  Theorem  3.1. 
Thus,  we  can  suppose  that  B(-,-)  and  (Xn)  are  bounded.  For  each  v,  rewrite 
(3.9)  as  (using  the  conditional  independence  implied  by  (3.11)) 
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Z.  Em,+nEkV1EkB(Xk,lk) 


(3.13) 


m,  k€H 
4  e 


Em  «  +n  Ek(Ak+v  A  k+1^  E^B  (Xk,^k)  +  Q* 


m,  k  €  I «  ‘"€*€T“£ 

s  *( 


where  E|Q^|  -*  0  uniformly  in  £,  as  v  -*  “  ,  by  Lemma  2.1. 

We  next  estimate  Ek+iAk+v'  '  Ak+r  A"  norms  are  in  the  lM  sense. 
Since  the  (Xn)  and  (An)  lie  in  a  compact  set,  the  function  of  6X  defined 
by 

6^16X1)  =  sup  |Fj(A,X)  -  Fj(A,X  +  6X)| 

A,X 

can  be  supposed  to  go  to  zero  as  |6X|  -»  0.  We  have  EnAn  =  F1(An.1,Xn)  = 
Fi<A»-rXn-i>  +  A1(An.l.Xn.1.XB)  where  |Al(AB.1,Xn_1,Xn)|  «  6x(|Xn  -  XnJ). 
Next  we  can  write  En.1AnAn.1  =  En.1(EnAn)An.1  =  En.1F1(An.1,Xn.1)An.1  + 

At<An-l’Xn-rXJAn-r  N°te  that 


(3.14)  En,Fl(4,,Xn,)An,=  F2(VrXn-i)  =  Ec"'1  AnA„-t 


which  is  just  the  expectation  for  the  2-step  fixed  X-process  with  X  fixed  at 
X  r  Using  this  and  |An|  =  1,  we  have 

(3.15)  En.1AnAn.1  =  F2(4.2,Xn_2)  +  error  terms, 
where 

lerror  terms)  =  l(F2( An.2,Xn.1)  -  F2(An_2,Xn.2))  +  VA„-rXn-i'Xn)An-il 

*  62(!Xn-l  -  Xn-2I)  +  61«X„  *  Xn-1»  ' 

Continuing  in  this  way,  we  get 


Mr 


(3.15)  Ek+lAk+V  •  ■  •  Ak+1  -  ^v(Ak’*k)  +  "^k 
where 

IT,' •'I  5  ^  6i(|X„+k.1+.  -  Xv+t_,|) 

i=l 

and  E|Tk,v|  ->  0  for  each  v,  uniformly  in  k,  owing  to  the  convergence  of  the 
X€(). 

Putting  the  estimate  (3.16)  into  (3.13)  yields  that 


(3.17) 


H£  =  I£  Em  ,  Fv(Ak,Xk)B(Xk,{k) 

m£  k€l|  mei£+n£  V  k  k  k  k 


<v  -  -m  i  +n  Tk'VB(Xk^k)- 

V  m,  k€  I  j  m€*€+n6  k  k  k 

t  *£ 


+  Q5  +  -  Er 


The  last  two  right  hand  terms  in  (3.17)  go  to  zero  in  mean  as  e  -»  0  and 
then  v  -  °°  ,  and  can  be  neglected.  The  sequence  Fy(A,X)  converges 

V" 

uniformly  to  the  continuous  function  4>(A,X)  as  v  ®.  Thus,  the  limit 
(as  €  -  0,  v  -  »)  of  the  first  term  on  the  right  side  of  (3.16)  is  the  same 
if  <1>(A,X)  replaces  Fy(A,X). 

Now,  we  are  in  a  position  to  use  the  result  of  [8,  Chapter  5.3]  or  [9],  By 
the  arguments  (for  the  Markov  model)  in  either  of  these  references  (which, 
when  adapted  to  our  current  situation,  requires  the  continuity  of  $(■,  ),  and 
(C3.7),  (C3.8),  (C3.10))  and  the  fact  that  X€(-)  -  X(  ),  we  have 


1 


*  € 


J  <J>(A,X(u))B(X(u),  Omx(u)  (dAdO, 


(3.18) 
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where  mx()  is  an  invariant  measure  for  the  process  {An(X),(n(X)}.  Since 
X(u)  =  (x(u),  x(u)),  the  uniqueness  and  product  form  of  the  invariant 

measure  in  (C3.9)  yields  that  mx(dAdO  =  P£(dA)  ■  P*(d  O-  Thus  the  right  side 
of  (3.18)  equals 

<Kx(u))B(x(u))  =  J  $(A,x(u))P^u>(dA)-  J  B(x(u),OPN(u)(dO, 
and  the  proof  is  concluded.  Q.E.D. 
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4.  State  Space  Constraints:  A  Projection  Algorithm 

In  many  applications,  it  is  desired  to  confine  the  iterates  to  a 
compact  set  L,  and  if  they  ever  leave  L,  the  algorithm  will  project  them 
back  onto  L.  Such  algorithms  are  ubiquitous  in  applications,  even  if  not 
explicitly  defined  or  assumed;  e.g.,  the  ambiguous  notion  of  ‘monitoring' 
in  adaptive  control  which  implicitly  assumes  some  sort  of  projection.  We 
treat  two  special  but  useful,  cases. 

Assumptions  and  Problem  Formulation 

(C4.1)  Let  gj(x),  i  *  a,  be  real  valued  continuously  differentiable 
functions  on  Er  and  define  L  =  {x:  g.(x)  S  0,  i  S  a).  Let  L  be  bounded- 
convex.  and  the  closure  of  its  interior.  Also  (w.l.o.g)  assume  that  the 
gradient  gix(x)  is  not  zero  if  g;(x)  =  0. 

Let  77L(y)  denote  the  (unique)  closest  point  on  L  to  y  t  Er.  We  use 
the  projected  form  of  algorithm  (1.1): 

X  , .  =  A  X  +  €b(X  ,$  ) 

n+l  n  n  v  n’^ir 

(4.1) 

xUi  =  s  q- 

Thus,  each  processor  projects  independently  and  the  constraint  set  is  the 
same  for  each.  We  now  set  the  problem  up  so  that  previous  results  can  be 
used. 

Define  pn  =  (p*,  ...,  p£),  where  p'n  =  [.Tn+1  -  X^+J]/e  and  define  5*  =  </>*  + 
E£[<t>(n|k+1)  -  <t’k+1]Plt-  Then  for  n  ?  n€  (n£  was  defined  below  (C3.41)) 


(4.2) 


Xn+1  “  X0  +  £  ^  *k+1B<Xk’*k>  +  6  £  Vl^k  +  £*n  • 

n€  n€ 

+  [«Kn|0)  -  $(n£|0)]Xo, 

w  here 

xo  =  ^n€|0)  XQ  +  i  “{  *k+1[B (Xk,?k)  +  pk]. 

The  two  cases  which  we  treat  are  covered  by  the  two  following  assumptions. 
(C4.2)  The  matrices  a^n)  j_n  An  take  the  form  a^n)  =  a.(n)lr  where  a^n)  is 
a  scalar  valued  random  variable  and  Lcx^n)  =  1. 

Under  (C4.2)  each  of  the  scalar  components  ‘communicated’  from  a 
processor  j  to  processor  i  are  incorporated  the  same  way  into  the  updated 
estimates  of  processor  i. 

(C4.2  ' )  There  are  bounded  gxi  and  g2i  such  that  L  =  {x:  gti  s  xi  S  g2i,  i  S  r). 

Definitions.  For  a  vector  field  h(  )  in  Er,  define  the  projection  onto  L 
by  (for  x  €  L)  n(x,h(x))  =  limA_^[nL(x  +  Ah(x))  -  x]/A.  By  the  convexity  of  L, 
the  limit  is  unique.  Define  the  convex  cone 

0x1  -  (y:  y  ’ 

where  A(x)  is  the  set  of  constraints  {i:  g;(x)  =  9)  (the  active  constraints 

at  x).  Note  that  p'n  (  -C(XJ,+1).  Write  Anpn  =  (Z* .  Z£).  where  Z'n  e  Er. 

Under  (C4.2),  each  Z‘  is  a  convex  combination  of  vectors  in  the 
-C(XJn+1),  j  S  q.  We  will  sec  below  that  the  same  property  holds  under 
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1 


The  theorem  is  stated  under  the  conditions  of  Theorem  3.1,  but  there  is 
an  analogous  result  under  the  conditions  of  Theorem  3.2. 

Theorem  4.1.  Assume  the  conditions  of  Theorem  3.1,  (C4.1)  and  either 
(C4.2)  ox  (C4.2’).  Let  the  solution  to  (4.3)  f the  projected  form  of  (3.1))  be 
unique.  Then  (X£(  )}  converges  weakly  to  X(),  where  X(  )  =  (x(),  ...,  x(  )) 


(4.3)  x  =  n(x,4>B(x)). 


Equivalent  1  v 


(4.4)  x  =  <tB(x)  +  v(x). 


where  v(x(t))  e  ~C(x(t))  (for  almost  all  t).  Also  X(0)  =  4>0X0  =  (x(0),  ...,  x(0)). 
il  X‘  €  L. 

Proof.  Only  (4.4)  will  be  proved,  since  (4.4)  implies  (4.3).  No 
truncation  (see  Theorem  3.1)  is  needed  here  since  X^  e  L,  a  compact  set. 
Define  the  process  R€(  )  by 


R€(t)  =  €  £  4>k+1pk  for  t  €  [(n-nt)«,  (n-n£  +  l)e) 


(analogous  to  the  definition  of  X€( -)  above  Theorem  3.1).  All  norms  below 
arc  in  the  sense.  For  X^  €  L,  the  q  r-vector  components  of  AnXn  are  all 
in  L  under  either  (C4.2)  or  (C4.21).  Thus  |p£|  $  |B(Xn,£n)|.  Hence,  the  proof 
of  uniform  integrability  of  {<£*}  and  {p£}  is  the  same  as  that  for  {0£}  given 
in  Theorem  3.1.  Thus  (X €(  •  ),R €(  • )}  is  tight  and  all  weak  limits  arc 
Lipschitz  continuous.  Henceforth,  wc  fix  and  work  with  a  weakly 
convergent  subsequence,  also  indexed  by  t,  and  with  limit  (X(  ),R(  •)). 


-*  ■  ■  -  '  *  ‘■T-V- 


As  in  Theorem  3.1,  for  i  «  q,  the  i,  i+r,  ....  i+rq-r  rows  of  4>k  arc  equal. 
Then  so  are  the  same  components  of  <t>k+1B(Xk,tk)  and  of  4>k+1pk.  Thus  (as  in 
Theorem  3.1)  X(-)  =  (x(-),  ....  x(  ))  and  R(  )  =  (R(  ),  ...,  R()),  where  x(t)  and 
R(t)  are  in  Er,  and 

(4.5)  x  =  $B(x)  +  R(t). 

Obviously  x(t)  €  L.  Thus,  we  need  only  show  that  R(t)  e  -C(x(t))  for  almost 
all  t. 

Write  X£(  )  =  (X€,1(),  ...,  X£,q()).  Let  x(t)  be  in  the  interior  of  L  for  t 
e  [t,,tJ  with  tx  <  t2.  Then,  by  the  weak  convergence  (i.e.,  convergence  of  all 
X£,l(  )  to  x(  ))  the  X£i‘(t),  i  «  q,  are  strictly  interior  to  L  on  [t1,t2]  with  a 
probbility  which  tends  to  unity  as  e  -  0.  Thus,  for  small  t,  the  cones 
C(X£,1(t)),  i  $  q,  tj  (t  Uj,  will  be  empty  with  a  probability  which  tends  to 
unity  as  €  -  0.  Thus  R(t)  =  0  for  tj  S  t  S  t2. 

We  need  now  only  consider  the  case  where  x(t)  is  on  the  boundary  of 
L  for  t  €  [trt2],  tj  <  t2.  Skorohod  imbedding  will  be  used  (see 
Appendix),  so  that  we  can  assume  that  the  convergence  is  with 
probability  one  on  each  bounded  time  interval.  Note  that  C(x)  is  an 
upper  semicontinuous  function  of  x  in  the  sense  that  if  xn  -*  x,  then 

(4.6)  C(x)  D  n  u  C(x  ). 

n  k=n  n 

Let  (gt  X(x(t)).  ...,  g;  X(x(t))  =  (Vj,  ...,  Va)  be  the  gradient  vectors  of  the 

1  a 

active  constraints  at  x  =  x(t),  and  let  Cg  denote  the  convex  cone  formed 
by  the  vectors  in  a  8-ni ighborhood  of  (vr  ..,  va). 


By  the  weak  convergence  (i.e.,  the  convergence  of  all  X€,1(  )  to  x(  - ))  and 

(4.6) ,  for  each  <3  >  0  and  y  >  0,  there  are  Bj  >  0  and  £j  >  0  such  that  for  €  S 

(4.7)  P{pk  6  — Cg,  i  S  q,  all  k  such  that  |€ (k — n e ) — 1|  S  B1}  £  1  -  7; 

i.e.,  for  «(k-n£)  close  enough  to  t,  the  pk  are  in  a  ‘small  neighborhood’  of 
-C(x(t))  with  a  probability  close  to  unity. 

Now,  assume  (C4.2).  Then,  each  of  the  q  r-vector  components  of 
4>k+1pk  is  also  in  such  a  ‘small  neighborhood’  with  a  probability  close  to 
unity,  for  e(k-n£)  close  to  t.  This  implies  that  R(t)  e  -C(x(t)),  for  almost  all 
t. 

Write  x(t)  =  (Xj(t),  ...,  xr(t)).  Assume  (C4.2'),  and  let  e(k-n£)  be  close  to 
t.  Then  C(x(t))  is  particularly  simple.  Write  p£  =  (pJk\  ...,  pJkr)  where  the  pJk' 
are  scalar  valued.  If  x;(t)  =  gu  (the  lower  limit)  then  (using  the  weak 
convergence)  X€()  $  (x(  ),  ...,  x(-)),  the  p^1  must  be  (asymptotically  in  e)  2  0 
for  all  j,  with  a  probability  arbitrarily  close  to  unity.  Similarly,  if  x;(t)  =  g2j 
(the  upper  limit),  then  (asymptotically  in  €)  the  p£‘  must  be  ^  0  for  all  j.  By 
the  properties  of  4>k  r  the  same  property  must  hold  for  the  respective 
components  (i,  i+r,  i+2r,  ...)  of  4>k+1pk. 

The  conclusion  follows  from  this  last  remark,  since  if  x  =  (xr  ....  xr), 
where  r.  =  gu,  i  «  rv  \i  =  g2i,  rx  <  i  i  r2  and  gi;  <  x;  <  g2i,  r2  <  i  S  r,  then  wc 
have  that  -C(x)  is  the  collection  of  vectors  whose  first  r:  components  are 
nonnegative,  the  next  r2  -  rt  are  nonpositive  and  the  last  r  -  r2  arc  zero. 


S.  The  Asymptotics  of  X€(  )  for  Large  t  and  Small  c 


Weak  convergence  in  D[0,“)  or  in  C[0,®)  basically  gives  information 
on  the  locations  and/or  distribution  of  Xe()  for  small  €,  and  for  t 
confined  to  some  large  —  but  still  bounded  interval.  See,  e.g.,  the 
discussion  of  the  topology  of  these  spaces  in  the  Appendix.  It  is 
important  to  have  a  convergence  result  which  is  valid  uniformly  in 
(large)  t  for  small  €,  and  such  a  result  is  readily  available  by  appropriate 
modifications  of  the  previous  results.  One  usually  requires  that  the  ODE 
satisfied  by  the  limit  processes  is  stable,  hence  we  assume 
(C5.1)  Let  (3.1)  (or  (3.12)  for  the  state  dependent  (An,£n)  case)  have  a  unique 
stable  (in  the  sense  of  Liapunov)  point  0  which  is  globally  attracting. 

Let  tf  -*  00  as  «  -*  0.  Quite  generally,  if  (C5.1)  (and  the  conditions  of 

Theorems  3.1  or  3.2)  holds,  then  X € (te  +  )  converges  weakly  to  a  constant 

process  X(  ),  where  X(t)  =  (0,  ...,  0).  This  is  precisely  the  desired  asymptotic 
result,  since  it  says  (roughly)  that  if  the  algorithm  is  ‘stable’  then,  after  a 
fixed  ‘transient  period’  (independent  of  €),  the  X'€(  )  arc  arbitrarily  close  to 
0  in  the  sense  of  weak  convergence. 

Discussion  of  the  Main  Idea  of  the  Development.  Suppose  that  the  set 
(5.1)  M  =  (X€(t),  t  >  0,  «  >  0) 

is  bounded  in  probability  (tight);  i.e.,  for  each  h  >  0  there  is  a  k^  <  ®  such 
that  P{|X€(t)|  *  k *  n,  for  all  e  >  0,  t  *  0.  Then  it  is  easy  to  show  that 

X€(t£  +  •)  =*  X(  ).  To  sec  this,  choose  T  >  0  and  consider  a  convergent 

subsequence  of  the  pair  of  processes  { X € ( t €  +  ).  X£(t£-T  +  )),  with  limit 
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denoted  by  (X(  ),XT(  ))  =  (x(-),  x(-);  xT(),  xT(  • ))  (recall  that  all  the 

r-vector  components  of  the  limits  arc  equal).  We  have  X(0)  =  Xt(T).  The 
value  of  XT(0)  is  unknown  —  but  all  the  possible  such  XT(0),  over  all  T  and 
convergent  subsequences,  belong  to  a  tight  set,  with  the  same  n  and  kn  as 
above.  By  this  and  the  stability  condition  (A5.1)  and  Theorem  3.1  (or 
Theorem  3.2),  for  any  6  >  0  there  is  a  T&  <  “  such  that  for  T  £  T&,  Xt(T)  = 
(xt(T),  ...,  xt(T))  will  be  in  a  8-neighborhood  of  (0,  ...,  8)  with  probability  5* 
1-8.  This  yields  the  desired  conclusion,  since  it  implies  that  X(0)  =  (0,  ...,  0) 
w.p.l.  Thus,  to  get  the  asymptotic  (in  t  and  e)  result,  only  (5.1)  must  be 
shown. 

Next,  consider  the  projection  algorithm  of  Section  4  and  assume  (C5. 1  * ) 
in  lieu  of  (C5.1): 

(C5.11)  Let  (4.4)  have  a  unique  stable  (in  the  sense  of  Liapunov)  point  0 
which  is  attracting  in  L. 

Under  (C5.1  '),  (5.1)  is  automatically  bounded  and  if  t£  -*  “  as  €  -  0  then 
under  the  additional  conditions  of  Section  4,  X€(t£  +  •)=*•  X(),  where  X(t)  = 
(8,  ...,  0).  Some  form  of  projection  algorithm  is  usually  used  in  practical 
algorithms,  and  so  the  tightness  condition  on  (5.1)  is  not  burdensome. 

Sharper  Bounds  on  the  Asymptotic  Errors  (X^  -  0),  for  Large  en  and 
Small  €.  Under  additional  ‘stability’  conditions,  one  can  get  order  of 
magnitude  estimates  for  (X1,€(t)  -  0)  for  large  t  and  small  e.  We  do  one  case 
here  in  preparation  for  the  rate  of  convergence  work  in  the  next  section.  W'e 
will  need: 


i 

s 


V(x)  -*  “  and  V(x)  >  0  for  x  t  0  such  that  for  some  X  >  0  and  K  <  °°, 
V^(x)4>B(x)  *  -\V(x),  |Vx(x)|2  S  K|V(x)  +  1J  and  V^  )  is  bounded. 

Define 


(5.2)  V(X)  =  £  V(Xj),  for  X  =  (X1,  ....  Xq). 

l 

(C5.3)  (C3.2),  but  where  BQ(X,£)  and  Bj(X)  are  bounded  and  have 

A 

bounded  and  continuous  X-derivatives  (uniformly  in  5,  for  B0). 

(C5.4)  There  is  a  constant  K  such  that 


E 


Vf"  Ev(<Dk+1B(X,Sk)  —  <t>  B(X)) 


2  S  K[V (X)  +  1], 


for  all  positive  m  and  v.  Similarly  for  the  derivatives  Bx  and  Bx 
replacing  B  and  Bj  respectively. 

Remark.  (C5.4)  essentially  implies  a  ‘low’  correlation  between  data  in 
the  remote  past  and  in  the  distant  future.  There  is  an  analogous  result  to 
Theorem  5.1  for  the  state  dependent  {An,£n}  case,  and  for  the  constrained 

case. 

Theorem  5.1.  Assume.  (C5. 1 )  to  (C5.4).  There  is  an  Nc  <  °°  for  each  small 
€  such  that 

EV(Xn)  =  OU?).  n  *  Ne. 

Proof.  We  always  assume  n  *  n£  so  that  E|4>(n|0)  -  4>0|a  =  0(€2),  for  any 


(5.3) 
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Xn+1  ■  Xn=  l**"!0)  -  ♦Cn-l|0)]X0  +  €(*«  - 

(5.4) 

+  £  *  B(Xn)  +  ‘[*„+lB«Xn^„)  -  ♦  B(Xn)] 

and 

EnV<Xn+i)  -  V<xn)  =  £Vx(xn>  En[<^n|0)  -  4>(n- 1  |0)]Xo 

(5.5)  +  e  v;(Xn)En  (*«  -  )  +  *  V;(Xn)  «  B  (XB) 

+  €  Vx(Xn)En  I*n+lB^Xn^n)  '  *  E  (Xn^  +  err0r  term 

where  E|error  term|  =  0(e2).  By  (C5.2)  and  n  ?  n£,  the  expectation  of  the 
first  term  on  the  rhs  of  (5.5)  is  0(e2)(l  +  EV(Xn)).  Write  4>n  in  the 
form 

where  <t>  is  a  r  x  qr  matrix.  For  n  £  n,, 

n  t 

(5.6)  IXj,  -  XJJ  =  0n(€2)  +  0(€)|^|, 

where  E|0n( € 2 )|2  =  0( € 4),  uniformly  in  n  J  n£. 

Using  (5.6),  rewrite  the  last  two  terms  on  the  right  side  of  (5.5)  as, 


respectively. 


(5.7) 


(.  I  Vx(X>)  $  B  (X'n)  +  error  term, 
i 

«  *  K  (Xn>En  [*n  +  lB(Xn  *  U  ’  *  ^Xn>]  +  err0r  tCrm> 
i 

where  by  (C5.2)  E|error  tcrm|  =  0( € 2)(  1  +  EV(Xn)). 

We  now  define  the  perturbations  to  the  Liapunov  function. 
Define 

V*(n)  by  V*(n)  =  -«vx(xn^n-r  We  have 

(5.8)  E|V*(n)|  =  0(0(1  +  EV(Xn)) 

(5.9)  E  V«(n+1)  -  EV<(n)  S  -€  EV^(Xn)(^  -  *«_,)  +  0(€2)  E  (1  +  V(Xn)). 


Define  V^€(n) 


(5.10) 


V‘2,€(n)  =  € 


r  Vi<X;>En  [V.  B(Xn'V-iB(X»>]- 


By  (C5.2)  and  (C5.4), 


(5.1  1) 
Also, 
(5.12) 


E|V‘2€(n)|  =  0(00  +  EV(Xn)). 


E„vi'  (n+l)  -  Vj«(n)  -  -<  vj  <X‘)  E„  -  i  B(X‘)] 

+  error  term  , 


where  by  (C5.4),  E|crror  tcrm|  =  0( € 2 )( 1  +  EV(Xn)). 

Now,  define  the  perturbed  Liapunov  function  V€(n)  =  V(Xn)  +  V*(n)  + 
E  Vj  €(n),  and  evaluate  En V€(n+1)-V€(n)  and  cancel  the  terms  ±eVx(Xn)(</r  - 
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Ci>  and  ±e  I  V'(X‘ )En[4n+1B(X‘ ,?.)  -  «  BfX^)]  to  get 


E  V€(n+1)  -  V€(n)  =  f  E  V  (X‘)  i>  B  (X‘)  +  error  terms, 

(5.13)  i 

E|error  term|  =  0( € 2 )(I  +  EV(Xn)). 


Using  (C5.2)  and  the  bounds  on  E|V*(n)|  and  on  E|V‘2'e(n)|,  we  get 
EV€(n+l)  -  EV€(n)  S  Xf  E  EVCX^)  +  0(£2)  (1  +  EV(Xn)) 

(5.14) 

«  -X€  EV€(n)  +  0(€2)  (1  +  EV€(n)). 


Hence,  for  small  f>0. 


(5.15)  EVe(n )  S  (1  -  ^)n'n£  V€(n  €)  +  0(f). 


This  together  with  the  bounds  on  E|V*(n)|  and  on  E|V'2i€(n)|  yield  the 
Theorem.  Q.E.D. 


6.  Rate  of  Convergence:  Qualitative  Asymptotic  Properties 

The  Liapunov  function  in  (C5.2)  is  often  locally  quadratic  about  8  in 
the  sense  that  V(x)  =  x'Qx  +  0(|x|3)  for  Q  >  0.  If  this  is  true,  then  Theorem 
5.1  implies  that  {(X'n  -  Sl/e1/2,  i  4  q,  n  ?  N£,  e  >  0}  is  tight.  In  this  section, 
we  will  suppose  that  there  are  N£  <  ®  for  each  small  e  >  0  so  that 

(6.1)  ,  i  «  q,  n  S  N£,  £  >  oj  is  tight,  Eb^e,^)  =  0. 

Under  (6.1),  one  can  apply  the  methods  of  the  ‘centralized’  case  to  get  a 
classical  rate  of  convergence  result. 

Much  information  concerning  the  asymptotic  behavior  and  comparison 
with  other  algorithms  can  be  obtained  from  such  a  result.  The  method  and 
results  will  be  discussed  in  an  informal  way  so  that  the  main  ideas  are  clear. 
Despite  the  informality,  the  conditions  needed  for  the  proof  will  generally 
be  stated.  The  proofs  follow  standard  lines  in  weak  convergence  theory,  and 
are  not  hard.  Our  aim  is  to  exhibit  the  asymptotic  behavior  of  the  suitably 
normalized  errors,  then  specialize  them  to  simple  cases  where  a  comparison 
can  be  made  with  ‘centralized’  forms  of  the  algorithm,  so  that  one  can  see 
the  effects  and  value  of  the  decentralization,  and  evaluate  alternative  forms 
of  communication  and  algorithms.  The  discussion  is  continued  in  the  next 
section.  Such  insights  are  needed  at  this  stage  of  development  of  the 
‘decentralized’  algorithms,  as  a  guide  to  future  developments  and  arc 
perhaps  more  important  than  a  rigorous  development  along  the  standard 
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lines.  We  will  use  the  assumption  (6.1),  the  boundedness  of  B(-,)  in  each 
bounded  X-set  and  that  B(-,0  has  a  continuous  (uniformly  in  5)  derivative, 
and  EB(X,0  =  0. 

For  any  R8  valued  function  p(-)  =  (px(  ),  ...)  of  x  (or  X),  let  (p(0))x 
denote  the  (Jacobian)  matrix  whose  ith  row  is  the  x  (respectively,  X)  gradient 
of  p'(  ).  Recall  the  definitions  of  and  $  (above  (C3.5)),  and  of  $n  (in 
Theorem  5.1).  Define  the  matrix  M  =  (4>  B(0))x  and  suppose  that  it  is  stable. 

Let 


m  (♦k+1B(0.(k)),  -  <*  B(e))x  -  M 


in  probability  as  n  -  00  and  m  -  00  . 

Define  U*  =  (Xn  -  !)/✓?,  where  0  =  (0,0,  ...,  6).  Recall  the  definition  of 
n€  given  below  (C3.41)  and  that  n€  can  be  chosen  such  that  ✓?n£  -  0.  Given 
N  >  0,  we  have,  for  n  f  n£  +  N, 

N+n, 


Un+1  =  *<n|N)U«  +  ✓€  I  <J>(n|k+l)B(Xk,tk) 


N 


(6.2) 


+  ✓€ 


^  .ViB<xk^k)  +  ^  . 


N+n, +1 


-£[«(n|k+l)-*k+1]B(Xk,5k). 


N+n, 


Define  (for  n  £  N  +  n,) 


W'  •  «  „i  ,,  V.W-V- 


N+n^ +1 
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Let  N  >  N£.  For  t  *  0,  define  the  process  U€()  by  U£(t)  =  U£  for  t  € 
[«(n-N-n£),  e(n-N-n£  +  l))  and  define  W€(  )  similarly  from  {W£}.  By  Taylor’s 
Theorem  and  the  definition  of  n£, 

Un+1  =  W N  +  °n(£i)UN  +  ^(n^T)  +  0  U?)*£ 

(6.3) 

+  €  t  (<J)k  +  lB(6^lc»XUk  +  Wn  +  °(£)  ^  °dUkl)’ 

N+n£  + 1  k+1  H  X  k  n  N+n£  +  l  k 

where  E|0n( € 2>|2  =  0(€4)  since  n  3  n£,  and  E|0n(n£v'F)|  =  0(n£v'F).  Also 
(<tk+1B(0,^k))x  denotes  the  matrix  whose  rows  are  the  X-gradients  of  the 
components  of  4>k+1  B(X,(k)  evaluated  at  X  =  (0,0,...). 

In  order  to  study  the  weak  convergence  of  U6(  ),  we  can  truncate  the 
dynamics  (as  in  Theorem  3.1)  if  {Uk}  is  not  bounded:  wherever  Uk  appears 
in  (6.3),  we  simply  replace  it  by  LTkqm(Uk),  where  qm(u)  =  1  for  |u|  S  m,  and 
is  a  smooth  function  with  compact  support.  We  get  the  weak  convergence 
with  use  of  q  ,  and  then  let  m  -  ®  .  The  uniqueness  of  the  solution  to  the 
limit  equation  (6.9)  below  guarantees  that  the  procedure  works.  For 
notational  simplicity  --  we  simply  suppose  that  {Uk}  is  bounded.  Suppose 
that  { W€ ( - ))  is  tight  and  has  continuous  limits.  Then,  this  also  holds  for 
(U€(  )}.  Also,  the  second,  third,  fourth  and  last  terms  on  the  right  side  of 

(6.3)  disappear  in  the  limit.  The  limit  of  any  convergent  subsequence 
satisfies 


(6.4)  U(t)  =  U(0)  +  f‘  ($  B(0))  U(s)ds  +  W(t), 

« n  x 


where  W(  )  is  the  limit  of  (W€()}. 


The  Limits  of  {W€()}.  Under  broad  condition?  W(  )  is  a  Wiener  process 
with  covariance 

<6-5>  <1  EViB(e.*k)B,(e.i0»1\ 

where  the  expectation  in  (6.5)  is  to  be  interpreted  in  the  ergodic  sense: 

■r 'm  T  Ew.B<e.wB,<eA>*i+.  • 

We  now  give  some  conditions  under  which  W(  )  is  the  asserted  Wiener 
process.  Let 

(6.6)  E  +?  <t>k+1B(6,$k)  4  «  Constant  ■  m2. 

then  (W€(  • )}  is  tight  and  all  limits  are  continuous  [6].  If 

(6.7)  m+£*1  \+iB(8,(k)/;i 

converges  in  distribution  to  a  normal  random  variable  (with  mean  zero)  as  n 
-  ®  and  m  -  <=°,  then  W(  )  is  a  Gaussian  process.  If,  for  tk  «  t2  s  t3  S  t4. 

(6.8)  E[W€(t4)  -  W^tg^W^t^  -  W€(tj)] 1  -  0, 

then  the  increments  of  the  limit  W(  )  are  orthogonal  and  the  limit  is  a 
(nonstandard)  Wiener  process.  The  proofs  follow  standard  lines  in  weak 
convergence  theory  [6].  The  properties  (6.6),  (6.8)  hold  if  the  (Ak)  is 
independent  of  the  {£k}  and  the  dependence  among  the  (k  decreases  fast 
enough  as  the  time  difference  increases.  Henceforth  we  assume  that  W(  )  is 
the  zero  mean  Wiener  process  with  covariance  (6.5). 
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For  the  same  reasons  that  the  X(  )  of  Section  3  took  the  form  X(  )  = 
(x(  ),  x(  ))  for  x(t)  «  Er,  we  have  U(  )  =  (u(),  ....  u(  ))  and  W()  = 
(w(-),  w(  )).  Then  (6.4)  reduces  to 

(6.9)  du  =  Mu  dt  +  dw. 

The  covariance  of  w(l)  can  be  obtained  from  (6.5):  Writing 
^(k),  •  •  •  ,  0q(k) 

♦j(k),  •  •  •  ,  0q(k) 
where  the  ^(k)  are  diagonal,  (6.5)  reduces  to 

(6.10)  cov  w(i)  =  r=  r  E^z^k+Dbke.t^j  ^(k+Dbke,^)]  . 

If  N  -•  <»  fast  enough  as  €  -  0,  then  the  limit  u(  )  is  the  stationary  solution  to 
(6.9). 

The  stationary  covariance 

(6.11)  reMt  ReM'‘dt. 

Jo 

of  (6.9)  is  a  standard  measure  of  the  ‘rate  of  convergence’  or  asymptotic 
quality  of  the  algorithm,  and  can  be  used  as  a  basis  of  comparison  among 
alternative  algorithms. 

A  Special  Case.  We  specialize  to  a  simple  case  in  order  to  get  some 


insight  into  the  asymptotic  behavior.  Let  {tk}  be  independent  of  (Ak)  with 
{$k,  i  S  q,  k  =  1,2,  ...)  mutually  independent  with  cov  b‘(9,Sk)  =  Rr  Then 


2- 


q  n+m 

(6.12)  cov  w(l)  =  Him  ^  E  0i(k)Ri<>i(k). 

1  m  n 

A  Scalar  System.  Let  r  =  1.  Then  $.(k)  and  ^  arc  scalars  and 
£  =  1  «  t  *i00. 

Let  b'(  -,  )  =  b(  - ,  - )  and  R(  =  R  not  depend  on  i.  Then  (6.9)  becomes  (bx(0)  < 
0) 

(6.13)  du  =  bx(6)u  dt  +  oDdw, 

where  w(  )  is  a  standard  Wiener  process  and  (where  by  the  expectation  E,  we 
mean  the  ergodic  mean  in  (6.12)) 

o^Rj  E^n) . 

The  stationary  variance  of  u(-)  is  Op/2|bx(0)|  s  varD. 

Comparison  with  a  ‘Centralized’  Algorithm.  Define  the  following 
centralized  algorithm,  under  the  scalar  system  assumptions  of  the  above 
paragraph 

(6-14)  Zn+1  -  Zn  +  «b(Zn,lJ).  Uj,  n  =  1,2,  ...}  i.i.d. 

Define  Vn  =  (Zn  -  0)/V?  and  define  v€(  )  by  v€(t)  =  Vn  on  [n«,  ne  +  e).  If  t£ 
-*  00  fast  enough  as  €  -  0,  then  under  appropriate  conditions  [8]  ve(t£  +  •)  ^ 
v(  ■ )  where 

(6.15)  dv  =  bx(0)v  dt  +  ✓Rdw. 


7.  Asymptotic  Properties:  Discussion  and  Comparison 


Independent  {An},  We  evaluate  varD/varc  under  the  conditions  of  the 
last  subsection  of  Section  6,  where  q  =  2  and  the  {An}  are  i.i.d.  In 
particular,  let  c  e  [0,1),  and  let  the  processors  act  independently,  with  p 
=  probability  i  communicates  to  j  t  i  at  time  n.  With  no 
communication  (probability  (1  -  p)2),  An  =  I;  if  2  communicates  to  1  - 
but  not  conversely  (probability  p(l  -  p)),  then 


If  1  communicates  to  2  (but  not  conversely),  then 


If  both  communicate  to  each  other,  then 


Refer  to  Table  1.  The  optimum  value  of  the  ratio  of  the  variances  is 
unity,  a  value  closely  approximated  by  small  c.  Clearly  a  larger  p  is 
desirable.  As  c  -*  0,  the  ratio  improves  --  but  the  size  of  the  4>*  would 
increase.  This  implies  that  one  must  wait  longer  for  the  stationary 
variance  to  be  a  good  indicator  of  the  actual  performance  (the  effects  of 
the  communication  are  realized  more  slowly).  Similarly,  for  small  p.  But, 
in  all  cases,  the  average  performance  is  much  better  than  that  for  the 


centralized  algorithm  (6.14). 
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.25 

.5 

.1 

1.036 

1,13 

1.312 

.3 

1.016 

1.10 

1.26 

.7 

1.008 

1.04 

1.13 

Tabic  1.  V'alues  for  2  varD/varc  =  varD/var2C 


A  Deterministic  Communication  Scheme.  We  retain  the  assumptions  of 
the  last  subsection,  except  for  those  on  the  communications.  Let  m  and  ml 
be  integers  with  m x  «  m/2.  Processor  2  communicates  to  1  each  m  units  of 
time,  and  1  communicates  to  2  mx  units  of  time  later.  We  use  A12,  A21 
(when  nij  *  0)  and  A0  (when  mx  =  0).  For  mx  =  0,  (2varD/varc)  =  1,  for  all 
0  <  1  <  c.  For  mx  *  0,  we  have  Table  2. 


c 

2  varD/varc 

.1 

1.0028 

.3 

1.03 

.5 

1.11 

Table  2.  2  varD/varc  =  varD/var2C 

The  values  of  m  and  mx  appear  only  in  the  values  of  <p which  increases  as 
m  and  mx  increase.  The  values  of  varD/var2C  are  substantially  worse  when 
processor  1  communicates  to  processor  2  more  often  than  the  reverse 
communication  rate,  for  deterministic  communication  times.  This  suggests 
that  a  relatively  balanced  communication  strategy  is  better  and  that  a 


y 
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proccssor  should  ’respond’  as  soon  as  possible  after  it  receives  a  ’message’ 
from  another  processor. 

Discussion.  It  is  clear  that  the  decentralized  algorithm  takes 
advantage  of  the  possibilities  of  parallel  processing,  since  its  variance  is 
better  than  that  of  the  classical  algorithm  (6.14),  and  can  be  nearly  as 
good  as  that  of  the  fully  centralized  algorithm  (6.16).  But  there  is 

another  advantage  —  which  can  be  considerable.  Simulations  with 
recursive  algorithms  such  as  (6.14)  indicate  that  a  key  problem  concerns 

the  frequently  slow  recovery  from  the  effects  of  large  ‘bursts’  of  noise; 
i.c.,  from  a  large  ‘random’  jump  in  the  state  value.  This  effect  would 
not  show  up  in  the  asymptotic  variance  estimates,  but  is  of  considerable 
importance  in  practice,  particularly  when  the  algorithm  is  not  in 

operation  for  a  very  long  time.  The  nature  of  the  ‘convexification’ 
should  often  reduce  the  magnitude  of  this  problem,  and  ‘robustify’  the 
process.  In  a  sense,  the  decentralized  algorithm  would  perform  much 

better  than  the  worst  of  q-identical  (but  not  communicating)  processors, 
and  (in  a  tracking  system,  for  example)  would  reduce  the  chances  of  any 
one  processor  losing  track.  In  applications  to  optimization  or  systems 
evaluation  by  monte-carlo  simulation  one  can  use  ‘variance  reduction' 
ideas  in  choosing  appropriate  correlations  among  the  sets  {^,  n  =  1,2,  ...}. 
i  S  q.  Hopefully  this,  together  with  the  above  ‘robustifying’  property, 
would  yield  good  behavior. 

An  Example.  The  following  is  an  example  which  opens  up  many  new 
possibilities.  Consider  two  receivers  —  say,  digital  phase  locked  loops  --  each 
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rccciving  a  signal  from  the  same  source,  but  the  two  being  physically 
separated.  Each  must  estimate  the  phase  or  epoch  of  the  signal  pulse  (and 
perhaps  the  phase  of  the  carrier).  Suppose  that  the  source  is  much  farther 
from  the  receivers  than  they  are  to  each  other,  so  that  more  reliable 
communication  between  the  receivers  is  possible.  It  might  be  possible  to 
improve  each  others  estimates  by  occasional  communcations.  This 
communication  would  transfer  the  estimates  —  as  well  as  allow  the  receivers 
to  improve  the  mutual  synchronization  of  their  clocks  or  oscillators  --  so  that 
the  transferred  estimates  can  be  meaningfully  used. 

Communication  Noise.  In  examples  such  as  the  preceeding,  one  would 
normally  have  communication  noise.  This  is  readily  incorporated  into  the 
analysis.  Write  (1.1)  as 

<7-')  X„+l  "  +  ®n>  + 

where  6n  represents  the  communication  noise.  For  the  algorithm  to  be  useful 
at  all,  this  noise  should  be  of  an  order  no  larger  than  €.  Then  write  6n  = 
e6n,  and  proced  as  before. 

V  _  ^ 

Even  if  &n  =  0(v'€)  and  E&n  =  0,  useful  results  can  be  obtained.  If  the 
interpolation  of 

converges  weakly  to  a  Wiener  process  W(),  then  we  might  have  X  (•)  =?  X(): 
dX  =  $  B(X)dt  +  dW. 


Again  X(  )  takes  the  form  (x(),  x(  • )),  under  appropriate  conditions  on 


An  Alternative  Algorithm.  To  get  additional  insight  into  the  behavior 
of  decentralized  algorithm,  we  formally  compare  (1.1)  with  a  reasonable 
alternative.  Suppose  that  the  processors  communicate  and  ‘convexify’  only 
the  changes  in  the  states  since  the  last  communication.  In  particular,  let  q  = 
2  and  let  (r'n},  i  =  1,2,  denote  the  comunication  times  of  the  two  processors, 
with  |T^+1  -  T'J  bounded.  Here  processor  2  communicates  to  processor  1  at 
{T*},  and  similarly  for  the  reverse  communication..  We  proceed  purely 
formally,  and  suppose  that  the  dynamics  are  smooth  and  bounded.  Define 
(X‘ }  by  X1  =  X1  and 

1  k  1  k 


<7-2>  xUi  =  +  ^(XU‘),  Tk  *  n  <  T> 


For  a  €  (0,1/2),  set 


T .  ,-l 


T i ,  ,-1 


x‘i  -  xl,  +  <>-“»«  f.1  b2(x;,[=) 

1  k  + 1  (k  Tr  t; 

k  k 

T2  -1  72  -j 

x%  =  X2  +  at  "f1  b^X^i)  +  (1-0)6  k£1  b2(X2$2). 


Owing  to  the  smoothness  and  boundedness  assumptions,  there  arc  O^e2)  = 

0( € 2)  and  a  process  Xn  =  (X*,X2)  satisfying  (7.4)  and  which  equals  (module 

0(€2))  (X*,X2)  and  (X1.  ,  X2,  )  (at  the  communication  times) 

T,1  T  * 


(7.4) 


X 


n-f  1 


x2 


n-f  1 


=  X*  +  (l-a)€b1(X^‘)  +  a£b2(X2,(2)  +  O'je2) 
=  X2  +  aeb^X*,^)  +  (l-a)£b2(X2^2)  +  02(e2). 


The  ‘size’  of  the  0^  depend  on  the  bound  on  [T^+1  -  rjj.  From  this  point  on. 
one  can  use  standard  theory  for  the  centralized  case  to  get  both  the  ODE 

A  - 

and  the  asymptotic  normalized  variance.  Define  X  (•)  as  X6(  )  was 
defined,  and  similarly  for  U€(  ).  The  limit  ODE  is 

(l-cOb^X1)  +  «b2(X2) 

(7.5)  X  =  _  =  BCX^X2) 

ctb^X1)  +  ( 1  — a)b2(X2) 


The  limit  U(  )  of  (U€()}  satisfies 
(7.6)  dU  =  MU  dt  +  dW  , 

where 

cov  W(l)  =  _J  ElJo  , 

(l-txlb^e,^)  +  ab2(6,(2)' 
otbVe,^)  +  (l-a)b2(0,(2)_ 

M  =  (B(0,0))x  , 

A 

and  we  suppose  that  M  is  a  stable  matrix. 

Comparison  of  the  Alternative  (7.2),  (7.3)  with  (1.1).  We  use  the  special 
scalar  case  of  the  first  subsection  of  this  section,  where  {(')  arc  i.i.d.  and 
b*(  ■ .  ■ )  =  b2(  • ,  • )  =  b(  •,  ).  Then  (again  bx(8)  <  0) 


m 

i 

W 


$ 


§j 


v. «» 


(l-a)b(X1)  +  ab(X2) 


abCX1)  +  (l-a)b(X2) 


dU  =  bv(9) 


(1-a)  a 


a  (1-a) 


U  dt  +  dW  =  MUdt  +  dW 


cov  W(l)  =  Eb^e,^) 


(1— a)2  +  a2  2a(l-a) 


2ct(l-a)  (1-a)2  +  a2 


Let  varD2  denote  the  stationary  variance  of  U(  ).  As  a  t  1/2,  this 


converges  to  the  infimal  value,  equal  to  var2C.  But  at  a  =  1/2,  the  matrix  M 


is  singular.  Thus,  again,  there  seems  to  be  a  trade-off  between  the  ‘minimal 


asymptotic  variance’  and  the  length  of  time  one  must  wait  for  the  asymptotic 


estimates  to  be  valid  or,  similarly,  for  the  communication  to  be  effective.  At 


this  point,  the  alternative  algorithm  does  not  seem  to  offer  any  clear 


advantages.  It  was  investigated  simply  because  of  the  idea  that  there  might 


be  an  advantage  in  communicating  only  recent  data. 
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8.  Stochastic  Approximation  With  tm  ■*  0 

The  entire  development  can  be  repeated  if  €  is  replaced  by  0  <  «n  -  0, 
I«n  =  00  .  One  then  gets  results  of  classical  stochastic  approximation  type, 
and  we  only  make  a  few  formal  comments.  We  use  X  , ,  =  A  X  + 

w  n+i  n  n 

€nb(Xn,$n)-  Define  tn  =  IJ'V  and  (for  t  ?  0)  define  Xn(  )  by  Xn(t)  =  Xn+j 
for  t  e  [t.-tn,  ti+1~tn).  Under  the  conditions  of  Theorem  5.1,  limnEV(Xn)  < 
Given  either  this  or  the  use  of  the  projection  algorithm  of  Section  4,  one 
can  get  the  appropriate  ODE  which  characterizes  the  limit  paths.  If  this 
has  the  appropriate  stability  properties  (as  in  Section  5),  we  can  show  that 
X£,1(  )  =>  x(  )  =  8.  The  ODE  is  the  same  as  that  in  the  previous  sections,  for 
all  the  same  cases. 

If  <  «  ,  then  the  idea  in  [10]  can  be  adapted  to  get  w.p.l  convergence 
results. 


w 


Sw 


Appendix.  Some  Results  in  Weak  Convergence. 


For  some  integer  s,  let  D[0,®)  denote  the  space  of  E'-valued  functions 
on  [0,®)  which  are  right  continuous  and  have  left  hand  limits,  with  the 
Skorohod  topology  [7,  Chapter  2],  This  topology  is  defined  as  follows.  Let  A 
be  the  set  of  strictly  increasing  Lipschitz  continuous  functions  from  [0,®) 
onto  [0,®).  Define  the  metric 


d(x(  ),y(  ■))  =  infmax 
Xt  A 


sup 

«>t?0 


log 


(**m)|.J  e'TdT(x(  ),y(  ),\)dr 
o 


where  dT(x(  •  ),y(  - ),X)  =  min(l,  sup  |x(X(t)DT)  -  y(X(t)nr)|). 

Define  {Z£}  and  {Z€( -)>  by  Z£+1  =  Z£  +  eF£,  Z£(t)  =  Z£  [ne,  ne  +  e).  If 
{Zq }  is  tight  and  the  {F£}  are  uniformly  integrable,  then  {Z£( -)}  is  tight  in 
D[0,®)  and  all  weak  limits  are  absolutely  continuous. 

Let  Z£(  • )  ^  Z(  • )  in  D[0,®).  By  a  suitable  choice  of  the  probability 
space,  the  weak  convergence  becomes  convergence  w.p.l  in  the  metric  of 
D[0,®)  [13,  Theorem  3.1.1].  I.e.,  there  is  a  probability  space  (ft,B,P)  with 

processes  {Z£(  )},  Z(  )  defined  on  it  so  that  for  each  Borel  set  B  in  D[0,®), 
P(Z£(  .)  e  B}  =  P{Z£(  )  e  B),  P{Z(  )  €  B)  =  P{Z(  )  €  B)  and  Z£(  )  -  Z(  ) 
w.p.l  in  the  topology  of  D[0,®).  The  use  of  this  representation  often 
facilitates  the  analysis  and  characterization  of  the  limits. 


l 
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